c(120,360,600,840)[1] 120 360 600 840
This document is meant as an addition to the Seminar Series on Data Analysis within the TRR259 in 2025. It is the companion document for the first Session. All materials can be accessed on the the github repository.
In the first part, we will repeat the basic data structures of R. This is fairly technical, but is the necessary groundwork for more interesting things to come. Feel free to refer back to it at any point. Similarly, you can find another document like this on the github called basic_R.html walking you through the very basics.
In the second part, we are exploring the tidyverse with a focus on dealing with tables and spreadsheets. We will add and subtract columns, arrange and access specific parts of the tables to get to the information that we want.
Lastly, we will generate summary tables of a fictitious study cohort using the gtsummary package.
Apart from variables (values stored in R under a certain name), there are 5 data structures in R:
vectors
factors
matrix
data frames
lists
A way to store multiple values in R is using a vector. Vectors are generated by concatenating values with the c() function.
c(120,360,600,840)[1] 120 360 600 840
Like variables, you can store vectors under a certain name.
ages <- c(120,360,600,840)These values can then be accessed using that same name again:
ages[1] 120 360 600 840
We can now use our ages vector to take a closer look at it. To see how many values are in ages, we can use the length() function. We can access any specific value in ages using square brackets “[]”. We can also access multiple values at the same time by supplying a vector of positions.
length(ages)[1] 4
ages[3][1] 600
Vectors of numbers can be generated the same way as other vectors using c(), or generated using :.
ages[c(1,2)] # accesses values in the first and second position[1] 120 360
ages[3:5] # accesses values in positions 3, 4, and 5 [1] 600 840 NA
We can also not select specific positions:
ages[-3] # accesses everything but the third position[1] 120 360 840
We can do calculations with ages all at once:
ages # just ages[1] 120 360 600 840
ages + 5 [1] 125 365 605 845
ages / 60[1] 2 6 10 14
We can compare all ages in ages to something else:
ages > 480[1] FALSE FALSE TRUE TRUE
This returns a TRUE orFALSE for every value in ages. We can aggregate them using sum(), where every TRUE is treated as 1 and every FALSE is treated as zero:
sum(ages > 480) # this answers the questions: how many values are larger than 480[1] 2
We can use ages to calculate a mean from it.
mean(ages)[1] 480
This is a specialized version of a vector oftentimes used to store categorical data. It consists of the underlying vector of values and a separate argument called levels, which represents the different values in the factor.
f <- factor(c("C", "C", "C", "A", "A", "A"),
levels = c("A", "C"))
sort(f)[1] A A A C C C
Levels: A C
f <- relevel(f, "C")
sort(f)[1] C C C A A A
Levels: C A
Matrix is a rectangular data structure. It is very good at handling very, very large amounts of data in an efficient fashion, but we will only work with it in very limited capacity. A matrix can be subset using [] for rows (before the comma) and columns (after the comma). Both rows and columns can have names and can be accessed using the names.
m <- matrix(1:20, nrow = 5)
m [,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
m[1,][1] 1 6 11 16
m[,1][1] 1 2 3 4 5
colnames(m) <- c("a", "b", "c", "d")
m a b c d
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
rownames(m) <- c("v", "w", "x", "y", "z")
m a b c d
v 1 6 11 16
w 2 7 12 17
x 3 8 13 18
y 4 9 14 19
z 5 10 15 20
m["x", "a"][1] 3
The data frame is the work horse of our analyses. Like the matrix, it is a rectangular table, has rows and columns and is used to store most of the data we care about.
Similarly to the matrix, a data frame can be subset using []. In addition, columns can be accessed using the $ operator. The summary() function can give a brief overview of a data frame.
set.seed(259)
df <- data.frame(diagnosis = rep(c("CTRL", "AAA"), each = 5),
diameter = c(rnorm(5, 5, 0.5),
rnorm(5, 2, 0.1)),
age = round(runif(10, 45, 80)))
dfdf[,1] [1] "CTRL" "CTRL" "CTRL" "CTRL" "CTRL" "AAA" "AAA" "AAA" "AAA" "AAA"
df$diameter [1] 4.836350 5.244877 5.076563 4.788480 5.806510 1.950297 2.057933 1.955191
[9] 1.833386 1.952512
summary(df) diagnosis diameter age
Length:10 Min. :1.833 Min. :46.00
Class :character 1st Qu.:1.953 1st Qu.:55.25
Mode :character Median :3.423 Median :61.50
Mean :3.550 Mean :64.10
3rd Qu.:5.017 3rd Qu.:77.00
Max. :5.807 Max. :80.00
df$diagnosis <- as.factor(df$diagnosis)
summary(df) diagnosis diameter age
AAA :5 Min. :1.833 Min. :46.00
CTRL:5 1st Qu.:1.953 1st Qu.:55.25
Median :3.423 Median :61.50
Mean :3.550 Mean :64.10
3rd Qu.:5.017 3rd Qu.:77.00
Max. :5.807 Max. :80.00
Lists are a way to store diverse types of data. It will become more important in the second session, but here you are introduced to an example.
Lists can be subset using [[]]. Note that this is a double bracket. Single brackets return a list, not the elements of a list. A list has a defined length() like a vector and can be summarized with the str() function (str standing for structure).
examples <- list("a", c("c", "example"), df)
examples[[1]][1] "a"
examples[1:2][[1]]
[1] "a"
[[2]]
[1] "c" "example"
length(examples)[1] 3
str(examples)List of 3
$ : chr "a"
$ : chr [1:2] "c" "example"
$ :'data.frame': 10 obs. of 3 variables:
..$ diagnosis: Factor w/ 2 levels "AAA","CTRL": 2 2 2 2 2 1 1 1 1 1
..$ diameter : num [1:10] 4.84 5.24 5.08 4.79 5.81 ...
..$ age : num [1:10] 50 54 79 79 59 80 71 46 63 60
Lists can also have each individual element named and can be accessed using the names.
ducks <- list(number_of_ducks = 3,
duck_names = c("Tick", "Trick", "Track"),
duck_info = df)
ducks[[1]][1] 3
ducks[["duck_info"]]ducks$duck_names[1] "Tick" "Trick" "Track"
Throughout this session and this course, we will be making extensive use of the tidyverse family of functions. If you have not yet done so, you can install them using the install.packages() command as explained in the the Basic R document.
install.packages("tidyverse")We will, in this session and the next one, go through the packages and functions. All of the packages and their functions can be accessed with the library() function:
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 3.4.4 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The packages contained in the tidyverse are created for the handling of data, with a focus on consistency and legibility, as a form of grammar for data. To explore their potential, we will need some data. In the github for this course, you will find a folder called data, containing two subfolders with data in them. Both contain their own data sets:
The aneurysm survey is a small example data set, generated by me, containing 900 patients, their aortic diameter in 2 positions as well as some additional information on the patients, including sex, age and potential co-morbidities.
The expression data is taken from the airway R package, containing RNA Seq expression data from airway smooth muscle cells with and without steroid treatment. It will come to heavy use later in the course, but will also make an appearance in this session. The data set is split into two files, one containing the actual expression data for each sample and one annotation detailing which condition each sample is associated with.
The first functions we’re going to use are form the readr package, reading in our example data. This includes the path to the files. If you downloaded the complete folder from the github, you can put it in your working directory and use the same path as below, but the specifics might vary for your system. We read in the data using the read_tsv() function, as it is a tab-separated txt file, a very common file format.
library(tidyverse)
aa_data <- read_tsv("data/aneurysm_survey/example_data.txt")
aa_dataRows: 900 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): diagnosis
dbl (8): abdominal_diameter, thoracic_diameter, age, male, hypertension, cad...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In the folder, the same data is contained multiple times in different formats. All these formats can be used like below.
Since we are in Germany, the .csv format is a bit of a complicated issue: the German convention is to use a comma “,” for the decimal point, making the use of a comma-separated file difficult. For this purpose, both the read_csv() and read_csv2() functions exist, the latter accommodating the German speciality. Personally, I avoid csv files whenever I can, for precisely the reasons exemplified here. Similarly, you can store data in excel files, which can be very convenient, but comes with some issues.
Excel can modify the actual data, overwriting what was present and has a history of for example converting gene names to dates. It is generally unavoidable as a software, especially when working with clinicians, but you should not store your data in an excel format if you can avoid it at all. Part of the tidyverse, but not readr, is the read_xlsx() function from the readxl package. We can access it without using the library() command using the package::function() shorthand. This is recommended when only single functions from a package are being accessed.
All of the the following code generates the same data frame as the example above.
aa_data <- read_csv("data/aneurysm_survey/example_data.csv")
aa_data <- read_csv2("data/aneurysm_survey/example_data2.csv")
aa_data <- readxl::read_xlsx("data/aneurysm_survey/example_data.xlsx")Now that we have our data, we can take a quick look using the glimpse() function.
library(tidyverse)
glimpse(df)Rows: 10
Columns: 3
$ diagnosis <fct> CTRL, CTRL, CTRL, CTRL, CTRL, AAA, AAA, AAA, AAA, AAA
$ diameter <dbl> 4.836350, 5.244877, 5.076563, 4.788480, 5.806510, 1.950297, …
$ age <dbl> 50, 54, 79, 79, 59, 80, 71, 46, 63, 60
We see that the data consists of 9 columns and 900 rows. We also learn the name of the columns and some examples of values.
With data and a first idea what to expect, we can dive into the first group of functions from a major tidyverse package, dplyr. We will take a look on the following functions:
mutate() to create new columns
selecet() to select variables based on their names or position
filer() to filter our observations based on their values
summarise() to generate summary statistics based on multiple values
We will start by making a copy of our data, since we want to keep the original intact.
df <- aa_dataWe learned above that we have 900 patients, but no identifier for the patients, so using mutate(), we can add one. Here we use the paste() function to combine “Pat” simply with a counter from 1 to 900, and save it in the patient variable.
df <- mutate(df, patient = paste("Pat", 1:900, sep = "_"))Let’s take a look:
glimpse(df)Rows: 900
Columns: 10
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
$ patient <chr> "Pat_1", "Pat_2", "Pat_3", "Pat_4", "Pat_5", "Pat_6…
The new column has been added to the end. If we want it to instead be at the front, we can do the following:
df <- aa_data # fresh start
df <- mutate(df, patient = paste("Pat", 1:900, sep = "_"), .before = 1) #the .before specifies the position
glimpse(df) # taking a lookRows: 900
Columns: 10
$ patient <chr> "Pat_1", "Pat_2", "Pat_3", "Pat_4", "Pat_5", "Pat_6…
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
We can also do some computations for the column we generate. We have an age column, and a brief inspection reveals that it probably is age in years. If we want months instead, we can just calculate this using the age column.
df <- mutate(df, age_months = age * 12)
glimpse(df)Rows: 900
Columns: 11
$ patient <chr> "Pat_1", "Pat_2", "Pat_3", "Pat_4", "Pat_5", "Pat_6…
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
$ age_months <dbl> 768, 588, 660, 624, 624, 828, 888, 564, 624, 912, 5…
Things can also be more complicated: we can classify our patients based on their abdominal aortic diameter. Here, we’ll use the ifelse() function. When you encounter something new, you can always run a question mark followed by the function in question, in this case ?ifelse() to get some more information.
df <- mutate(df,
severe_AAA = ifelse(abdominal_diameter > 5, # if an observation has an abdominal diameter larger than 5
"severe", # the new variable will classify them as severe
"non_severe")) # if not, as non_severe
glimpse(df)Rows: 900
Columns: 12
$ patient <chr> "Pat_1", "Pat_2", "Pat_3", "Pat_4", "Pat_5", "Pat_6…
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
$ age_months <dbl> 768, 588, 660, 624, 624, 828, 888, 564, 624, 912, 5…
$ severe_AAA <chr> "non_severe", "non_severe", "non_severe", "non_seve…
You can also do calculations with multiple columns
df <- mutate(df, diameter_ratio = abdominal_diameter / thoracic_diameter)
glimpse(df)Rows: 900
Columns: 13
$ patient <chr> "Pat_1", "Pat_2", "Pat_3", "Pat_4", "Pat_5", "Pat_6…
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
$ age_months <dbl> 768, 588, 660, 624, 624, 828, 888, 564, 624, 912, 5…
$ severe_AAA <chr> "non_severe", "non_severe", "non_severe", "non_seve…
$ diameter_ratio <dbl> 0.9925326, 1.0086654, 0.9530144, 1.0958731, 1.22850…
When creating columns, you can overwrite existing columns by creating a “new” column with the same name. Let’s say we want to flip that ratio:
df <- mutate(df, diameter_ratio = thoracic_diameter / abdominal_diameter)
glimpse(df)Rows: 900
Columns: 13
$ patient <chr> "Pat_1", "Pat_2", "Pat_3", "Pat_4", "Pat_5", "Pat_6…
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
$ age_months <dbl> 768, 588, 660, 624, 624, 828, 888, 564, 624, 912, 5…
$ severe_AAA <chr> "non_severe", "non_severe", "non_severe", "non_seve…
$ diameter_ratio <dbl> 1.0075236, 0.9914090, 1.0493021, 0.9125144, 0.81399…
We can also round the results. For that, we can specify during creation, or just refer to the already generated column
# these are equal
df <- mutate(df, diameter_ratio = round(diameter_ratio, 2))
df <- mutate(df, diameter_ratio = round(thoracic_diameter / abdominal_diameter, 2))
glimpse(df)Rows: 900
Columns: 13
$ patient <chr> "Pat_1", "Pat_2", "Pat_3", "Pat_4", "Pat_5", "Pat_6…
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
$ age_months <dbl> 768, 588, 660, 624, 624, 828, 888, 564, 624, 912, 5…
$ severe_AAA <chr> "non_severe", "non_severe", "non_severe", "non_seve…
$ diameter_ratio <dbl> 1.01, 0.99, 1.05, 0.91, 0.81, 1.02, 0.94, 0.96, 0.8…
Most of the time, not all observations (or patients) are relevant for the question at hand. To separate the ones that we care about for those we do not, we’re going to use filter(). With our data frame, we can use conditions to get to the relevant observations. Let’s say we only want patients who have been diagnosed with AAA:
df <- aa_data
df <- filter(df, diagnosis == "AAA") # == is a comparison, = is assignment
glimpse(df)Rows: 300
Columns: 9
$ diagnosis <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "A…
$ abdominal_diameter <dbl> 3.535668, 3.897484, 4.831148, 4.427527, 3.206244, 4…
$ thoracic_diameter <dbl> 2.096804, 1.998180, 2.024948, 2.080301, 2.125109, 2…
$ age <dbl> 74, 80, 55, 68, 56, 59, 61, 51, 46, 53, 78, 54, 61,…
$ male <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, …
$ hypertension <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, …
$ cad <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, …
$ artheriosclerosis <dbl> 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, …
$ bmi <dbl> 21.63, 32.45, 29.95, 28.60, 19.83, 28.38, 25.00, 30…
What if we only want those with a truly wide aorta, lets say larger than 4.5 cm:
df <- filter(df, abdominal_diameter > 4.5)
glimpse(df)Rows: 87
Columns: 9
$ diagnosis <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "A…
$ abdominal_diameter <dbl> 4.831148, 5.101495, 5.712208, 4.894991, 4.718541, 4…
$ thoracic_diameter <dbl> 2.024948, 2.123469, 2.109896, 1.928095, 2.019829, 1…
$ age <dbl> 55, 51, 78, 61, 72, 64, 69, 63, 56, 52, 59, 74, 79,…
$ male <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, …
$ hypertension <dbl> 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, …
$ cad <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ artheriosclerosis <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, …
$ bmi <dbl> 29.95, 30.40, 35.38, 20.21, 37.81, 29.49, 29.37, 23…
We can see, that only 87 observations remain.
These filtering steps can also be more complicated. If we want triple A patients who will be recommended surgery, but since it is an experimental procedure we want to exclude patients above the age of 75, we can also do that:
df <- aa_data # fresh start
df <- filter(df, age < 75, # patients younger that 75
abdominal_diameter > 4.5) # with an abdominal diameter larger than 4.5
glimpse(df)Rows: 72
Columns: 9
$ diagnosis <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "A…
$ abdominal_diameter <dbl> 4.831148, 5.101495, 4.894991, 4.718541, 4.697727, 4…
$ thoracic_diameter <dbl> 2.024948, 2.123469, 1.928095, 2.019829, 1.872085, 1…
$ age <dbl> 55, 51, 61, 72, 64, 69, 63, 56, 52, 59, 74, 60, 67,…
$ male <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, …
$ hypertension <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, …
$ cad <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
$ bmi <dbl> 29.95, 30.40, 20.21, 37.81, 29.49, 29.37, 23.91, 41…
We can also use the | (read: or) operator to get to all patients with a large diameter, independent of location:
df <- aa_data
df <- filter(df, abdominal_diameter > 4 | thoracic_diameter > 4)
glimpse(df)Rows: 294
Columns: 9
$ diagnosis <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "A…
$ abdominal_diameter <dbl> 4.831148, 4.427527, 4.247534, 5.101495, 5.712208, 4…
$ thoracic_diameter <dbl> 2.024948, 2.080301, 2.092148, 2.123469, 2.109896, 1…
$ age <dbl> 55, 68, 59, 51, 78, 61, 53, 55, 56, 72, 71, 51, 64,…
$ male <dbl> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, …
$ hypertension <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, …
$ cad <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ artheriosclerosis <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ bmi <dbl> 29.95, 28.60, 28.38, 30.40, 35.38, 20.21, 29.88, 30…
Only male patients, with coronary artery disease (cad) and without hypertension:
df <- aa_data
df <- filter(df, male == 1, hypertension != 1, cad == 1)
glimpse(df)Rows: 42
Columns: 9
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "AA…
$ abdominal_diameter <dbl> 1.799354, 1.895594, 2.078589, 2.068296, 2.076677, 2…
$ thoracic_diameter <dbl> 1.641936, 2.270872, 1.918303, 1.839884, 2.221215, 2…
$ age <dbl> 52, 71, 68, 56, 49, 58, 53, 62, 67, 76, 50, 60, 48,…
$ male <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ artheriosclerosis <dbl> 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, …
$ bmi <dbl> 22.26, 22.64, 22.76, 24.46, 22.65, 20.88, 29.88, 29…
In a query like above, it is sometimes of interest to know how many observations belong to each group. For that, we can use the count() function, while specifying what we should count.
df <- aa_data
df <- filter(df, male == 1, hypertension != 1, cad == 1)
count(df, diagnosis)We can see that the combination of factors is not equally distributed across the diagnoses. This is not a statistical test, but can be nice to know.
We can also filter for specific patients. In this example, this might be non-sensical, as we are also adding in the patient information, but it is easy to imagine a scenario where a query like this might be useful.
df <- aa_data
df <- mutate(df, patient = paste("Pat", 1:900, sep = "_"))
df <- filter(df, patient %in% c("Pat_32", "Pat_565"))
glimpse(df)Rows: 2
Columns: 10
$ diagnosis <chr> "CTRL", "AAA"
$ abdominal_diameter <dbl> 1.999033, 4.507430
$ thoracic_diameter <dbl> 2.195074, 2.112819
$ age <dbl> 57, 56
$ male <dbl> 0, 0
$ hypertension <dbl> 1, 1
$ cad <dbl> 0, 0
$ artheriosclerosis <dbl> 1, 1
$ bmi <dbl> 20.20, 31.57
$ patient <chr> "Pat_32", "Pat_565"
We covered how you can filter the observation based on their values, but it is rare that all the variables are of interest. You can select a subset of columns using select().
We can specify the columns that we want to retain.
df <- aa_data
df <- select(df, diagnosis, age, abdominal_diameter)
glimpse(df)Rows: 900
Columns: 3
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
We can also specify which columns we do not want.
df <- aa_data
df <- select(df, -age)
glimpse(df)Rows: 900
Columns: 8
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
We do not need to refer to the columns by name, position is also allowed. This also works with negative selection.
glimpse(df)Rows: 900
Columns: 8
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
df <- select(df, 1:3)
glimpse(df)Rows: 900
Columns: 3
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
df <- select(df, -c(1:2)) # note that you do need the c() here, otherwise it will generate an error
glimpse(df)Rows: 900
Columns: 1
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2.…
You can also use a column type (class) together with the where() function. This uses where() and is.character() to select only the columns of class character.
df <- aa_data
df <- select(df, where(is.character))
glimpse(df)Rows: 900
Columns: 1
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTR…
Also works with numeric columns:
df <- aa_data
df <- select(df, where(is.numeric))
glimpse(df)Rows: 900
Columns: 8
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
These can be combined as desired
df <- aa_data
df <- select(df, diagnosis, where(is.numeric))
glimpse(df)Rows: 900
Columns: 9
$ diagnosis <chr> "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CTRL", "CT…
$ abdominal_diameter <dbl> 1.903918, 2.050726, 1.780361, 1.799354, 1.950962, 2…
$ thoracic_diameter <dbl> 1.918242, 2.033109, 1.868136, 1.641936, 1.588081, 2…
$ age <dbl> 64, 49, 55, 52, 52, 69, 74, 47, 52, 76, 48, 71, 61,…
$ male <dbl> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cad <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, …
$ bmi <dbl> 25.20, 22.95, 23.34, 22.26, 23.07, 23.57, 24.23, 22…
Writing code can fairly easily lead to a convoluted mess. The code below does a couple of things, none of which are particularly complicated. It filters the data frame on only AAA patients, adds a new variable using the diameter, then filters the data set based on the new variable as well as age.
df <- aa_data
df <- filter(mutate(filter(df, diagnosis == "AAA"),
operation = ifelse(abdominal_diameter > 4.5,
"operation",
"monitor")),
operation == "operation", age < 75)
glimpse(df)Rows: 72
Columns: 10
$ diagnosis <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "A…
$ abdominal_diameter <dbl> 4.831148, 5.101495, 4.894991, 4.718541, 4.697727, 4…
$ thoracic_diameter <dbl> 2.024948, 2.123469, 1.928095, 2.019829, 1.872085, 1…
$ age <dbl> 55, 51, 61, 72, 64, 69, 63, 56, 52, 59, 74, 60, 67,…
$ male <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, …
$ hypertension <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, …
$ cad <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
$ bmi <dbl> 29.95, 30.40, 20.21, 37.81, 29.49, 29.37, 23.91, 41…
$ operation <chr> "operation", "operation", "operation", "operation",…
The code below does the same thing.
df <- aa_data %>%
filter(diagnosis == "AAA") %>%
mutate(operation = ifelse(abdominal_diameter > 4.5,
"operation",
"monitor")) %>%
filter(operation == "operation", age < 75)
glimpse(df)Rows: 72
Columns: 10
$ diagnosis <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "A…
$ abdominal_diameter <dbl> 4.831148, 5.101495, 4.894991, 4.718541, 4.697727, 4…
$ thoracic_diameter <dbl> 2.024948, 2.123469, 1.928095, 2.019829, 1.872085, 1…
$ age <dbl> 55, 51, 61, 72, 64, 69, 63, 56, 52, 59, 74, 60, 67,…
$ male <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, …
$ hypertension <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, …
$ cad <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
$ artheriosclerosis <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
$ bmi <dbl> 29.95, 30.40, 20.21, 37.81, 29.49, 29.37, 23.91, 41…
$ operation <chr> "operation", "operation", "operation", "operation",…
This achieved is through use of the so called pipe, which can be typed as %>% or |>. If you are using RStudio, Ctrl + Shift + M inserts it (Cmd + Shift + M on MacOS). It takes the output of a line of code and inserts it as the input to the next line of code. The following 2 lines are equivalent.
select(df, diagnosis)
df %>% select(diagnosis)This works particularly well in the tidyverse, where the first input to functions is always the data.
This concept is both popular and controversial. Purists will tell you that people piping everything will never learn to code properly and evangelists will say that codes without pipes is needlessly difficult to read and understand. Your mileage may vary. An important thing to note is that piping is inherently inefficient from a resources point of view. Please be careful when using the pipe in the context of millions of rows and/or columns.
We have now learned how to add columns, filter observations and select the variables of interest. We’ll keep on using these tools throughout the next sessions as the backbone of everything we do. Before we move on from the dplyr family, we still have to meet two more workhorses of the package: summarise() and groupy_by().
summarise() allows you to perform summary statistics and other computations taking in more than one value. If, for example, you want to calculate the mean or median of a certain variable, you would use summarise().
aa_data %>%
summarise(mean_abdominal_diameter = mean(abdominal_diameter))aa_data %>%
summarise(median_age = median(age),
mean_age = mean(age),
sd_age = sd(age))On their own, these calculations are fairly simple and easily doable in both base R and something like Excel, but it gets more interesting once you add group_by(). This allows you to create groups within your data and to calculate the statistics for each group separately. In the example below, we calculate the mean for each diagnosis separately.
aa_data %>%
group_by(diagnosis) %>%
summarise(mean_abdominal_diameter = mean(abdominal_diameter))This applies to multiple calculations as well:
aa_data %>%
group_by(diagnosis) %>%
summarise(median_age = median(age),
mean_age = mean(age),
sd_age = sd(age))We can also group by more complicated things like cutoffs for variables. If we want to know if the mean BMI between AAA patients with an aneurysm larger than 4.5 is different from those with a smaller diameter, we can do so
aa_data %>%
filter(diagnosis == "AAA") %>% # filter for AA patients only
group_by(abdominal_diameter > 4.5) %>% # group by the relevant cutoff
summarise(mean_bmi = mean(bmi),
sd_bmi = sd(bmi))Sometimes, you want to know, what the highest values or lowest values in your data set are, or you want to subset on the top 10 based on some metric. For this purpose, we have three tools: arrange() to reorder the rows of your data set based on a column of choice, slice_max() filters out the top X rows based on a column of choice, and slice_min() does the opposite.
By default, arrange() sorts ascending. So if we want the data frame arranged by the abdominal diameter, we can run:
aa_data %>%
arrange(abdominal_diameter) %>%
head(10)To change this, simply attach a “-” before the variable of choice:
aa_data %>%
arrange(-abdominal_diameter) %>%
head(10)If we only want a subset of rows, we instead can use slice_min() or slice_max(). To use it, we supply the data frame and available we want to order by, as well as N, the number of rows we want.
aa_data %>%
slice_max(abdominal_diameter, n = 3)aa_data %>%
slice_min(abdominal_diameter, n = 3)Once we have the basics down, we can move on to more complicated operations. For that, we will start by reading in the other example data, namely an expression data set:
library(tidyverse)
counts <- read_tsv(here::here("data/expression_data/counts.txt"))Rows: 53563 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene
dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516, SRR1039...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
annotation <- read_tsv(here::here("data/expression_data/annotation.txt"))Rows: 8 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
dbl (1): avgLength
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(counts)annotationIn the counts, we have the first column as the gene symbol, and the other columns as the different samples. Our samples are further described in the annotation, where we find the sample name in the Run variable. In order to combine this information, we will need to do some data reshaping: we want to have all the samples as a column, so that for each row we can add the relevant group annotation (for our case here whether or not the sample was treated with dexamethasone, save in the column dex in annotation). For that, we will need two things: pivot_longer() and inner_join().
We will start with counts and pivot_longer(). This reshapes from its current wide format to a longer format: we lose columns and gain rows, without losing any information.
head(counts) # quick look at how the data lookscounts_long <- counts %>% pivot_longer( # take the data and pivot longer
cols = -gene, # select the columns to pivot longer, here everything but the gene column (similar to select)
names_to = "sample", # put the column names into a new column called sample
values_to = "count" # put the corresponding values ion a new column called count
)
head(counts_long)The data frame still contains all the original information, just the organisation has changed.
Taking this data frame, we can now add the annotation information to it. This information currently is stored in annotation. We will first select the relevant columns from annotation, namely Run containing the sample name, dex containing the dexamethasone information, and cell, indicating the cell line of origin.
anno_filtered <- annotation %>%
select(Run, dex, cell)
anno_filteredWith this new annotation, we will now merge it into the expression data. We will use inner_join(), a function from the dplyr package. It creates new columns with the data given, sorting it into an existing data frame based on the by argument as supplied. For more information, check out ?inner_join().
df_long <- counts_long %>%
inner_join(anno_filtered, # join in the filtered annotation
by = join_by(sample == Run)) # join it, using the sample information from the counts and the Run variable from the annotation
head(df_long)We now have a data frame that contains all the information we want/need in df_long. Using the knowledge from the previous section, we can calculate the mean counts for a gene of interest (note: this is not how to handle RNA Seq data, but is done here for demonstration purposes).
df_long %>%
filter(gene == "IL6") %>%
group_by(dex) %>%
summarise(mean_expression = mean(count))With our data in long format, we can do various things as demonstrated above. We can also re-transform the data back into wide format; this does not necessarily mean the same shape and organisation as before. Here, we transform the data in such a way that we keep the samples and associated sample information as rows and instead move gene names to the columns.
df_wide <- df_long %>%
pivot_wider(
names_from = "gene",
values_from = "count"
)
head(df_wide)The important thing for this kind of transformation is that each observation and each value is always distinctive, ie.´you can always tell from which sample and variable a value is derived.
When we have our data in shape and organisation that we want to, we can do some interesting stuff with it. If we for example take the long_df we used in the previous section (I will regenerate it here for legibility):
library(tidyverse)
counts <- read_tsv(here::here("data/expression_data/counts.txt"))Rows: 53563 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene
dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516, SRR1039...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
annotation <- read_tsv(here::here("data/expression_data/annotation.txt"))Rows: 8 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
dbl (1): avgLength
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
counts_long <- counts %>% pivot_longer( # take the data and pivot longer
cols = -gene, # select the columns to pivot longer, here everything but the gene column (similar to select)
names_to = "sample", # put the column names into a new column called sample
values_to = "count" # put the corresponding values ion a new column called count
)
anno_filtered <- annotation %>%
select(Run, dex, cell)
df_long <- counts_long %>%
inner_join(anno_filtered, # join in the filtered annotation
by = join_by(sample == Run))With this data we look into genes in the data set. Please note again, that this is not a blueprint for RNA-Seq analysis, it is used as a dataset with a large amount of diversely named variables, namely genes.
Here we use str_detect() from the stringr package. It takes a string (computer speak for writing) as input and detects if a certain pattern is present
library(tidyverse)
the_string <- "Hello, this is some writing"
str_detect(the_string, "Hello") # this will be true, since "Hello" is part of the string[1] TRUE
str_detect(the_string, "Howdy") # this will be fales[1] FALSE
we can combine this with filter() to filter out genes showing a consistent naming pattern, like mitochondrial genes. The genes start with “MT-”, so we can use that as a pattern to only look at these genes, here combined with group_by() so that we get a mean per gene and treatment condition.
df_long %>%
filter(str_detect(gene, "MT-")) %>%
group_by(gene, dex) %>%
summarise(mean_count = mean(count))`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
Similarly, we can use a simliar pattern in the interleukins, a family of immune mediators. They tend to start with “IL”:
df_long %>%
filter(str_detect(gene, "IL")) %>%
group_by(gene, dex) %>%
summarise(mean_count = mean(count))`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
If you look at the genes being filtered here, the pattern was insufficient, picking up genes like AVIL or EMILIN1. Using a special pattern indicating the start of the string, known as a regular expression or regex, we can make the pattern a lot more specific. This is more advanced stuff, but you can find an overview of various regex patterns here.
df_long %>%
filter(str_detect(gene, "^IL")) %>% # the ^ indicates the start if the string
group_by(gene, dex) %>%
summarise(mean_count = mean(count))`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
While not perfect, this approach excludes a lot more undesired genes.
Paging through the results above, you will notice genes containing the “-AS” pattern - this indicates antisense RNAs, not protein coding genes. You can exclude them using a similar approach. Here, the ! in front of the str_detect() negates the pattern, excluding the results.
df_long %>%
filter(str_detect(gene, "^IL"),
!str_detect(gene, "-AS")) %>%
group_by(gene, dex) %>%
summarise(mean_count = mean(count))`summarise()` has grouped output by 'gene'. You can override using the
`.groups` argument.
str_detect() can be very useful for filtering and the like, but there are two more workhorses in the stringr package: str_replace() and its special form, str_remove().
This will be a lot more pressing once we get to plotting with its labels and text, but lets say we want to change the values for the dex column in annotation. We can use the str_replace() function for this. Checking out ?str_replace(), we can learn that it expects 3 things: a string, a pattern and replacement; it then does what it says on the package, it replaces the pattern with the replacement. In the example below, we replace every occurence of “trt” in the dex column with “treated”.
anno_filteredanno_filtered %>%
mutate(dex = str_replace(dex, "trt", "treated"))If we want it a little more pretty, we can also add a “dexamethasone” to the start using paste():
df_long2 <- df_long %>%
mutate(dex = str_replace(dex, "trt", "treated"),
dex = paste("dexamethasone", dex))
head(df_long2)That seems a little long, and almost everyone understands it when you say “dexa”, so lets trim it back a bit.
df_long2 %>%
mutate(dex = str_remove(dex, "methasone")) %>%
head()Another quick example: in our aneurysm survey, we have the diagnosis column with three values: CTRL, TAA, AAA.
aa_data %>%
count(diagnosis)Working in Germany, sometimes we have to bow to the German nomenclature in the clinic, where instead of abdominal aortic aneurysm it can be referred to as BAA, Bauch (meaning abdomen) aortic aneurysm. If we want to apply this to the data set, we can use str_replace() to do so.
aa_data %>%
mutate(diagnosis = str_replace(diagnosis, "AAA", "BAA")) %>%
count(diagnosis)One of the things you can do with your data in R is to display them as tables. Imagine this in a supplementary of a paper, summarising your study cohort.
library(tidyverse)
library(gtsummary)
# if its your first time running this, you will need to run install.packages("gtsummary")
aa_data %>%
select(-bmi) %>%
tbl_summary()| Characteristic | N = 9001 |
|---|---|
| diagnosis | |
| AAA | 300 (33%) |
| CTRL | 300 (33%) |
| TAA | 300 (33%) |
| abdominal_diameter | 2.11 (1.94, 3.44) |
| thoracic_diameter | 2.10 (1.96, 3.79) |
| age | 62 (53, 71) |
| male | 608 (68%) |
| hypertension | 346 (38%) |
| cad | 109 (12%) |
| artheriosclerosis | 360 (40%) |
| 1 n (%); Median (Q1, Q3) | |
As a first shot, this does not look bad. What you see here is a summary table created using the gtsummary package, more specifically the tbl_summary() function.
The diameter columns are treated as continuous variables, as is the age. The logical (0 / 1) columns of age, hypertension, cad, and artheriosclerosis are counted with a percentage indicating their relative occurrence. The labels do need a bit of work.
We can specify the labels using the labels argument and supplying a list of the structure column_name ~ "Text you want".
aa_data %>%
select(-bmi) %>%
tbl_summary(label = list(
abdominal_diameter ~ "Abdominal diameter",
thoracic_diameter ~ "Thoracic diameter",
age ~ "Age",
male ~ "Male",
hypertension ~ "Hypertension",
cad ~ "Coronary artery disease",
artheriosclerosis ~ "Artheriosclerosis"
))| Characteristic | N = 9001 |
|---|---|
| diagnosis | |
| AAA | 300 (33%) |
| CTRL | 300 (33%) |
| TAA | 300 (33%) |
| Abdominal diameter | 2.11 (1.94, 3.44) |
| Thoracic diameter | 2.10 (1.96, 3.79) |
| Age | 62 (53, 71) |
| Male | 608 (68%) |
| Hypertension | 346 (38%) |
| Coronary artery disease | 109 (12%) |
| Artheriosclerosis | 360 (40%) |
| 1 n (%); Median (Q1, Q3) | |
That looks a lot better. However, we would like to split the cohort by their diagnosis, indicating any differences between them. For that we are using the by = diagnosis argument.
aa_data %>%
select(-bmi) %>%
tbl_summary(by = diagnosis,
label = list(
abdominal_diameter ~ "Abdominal diameter",
thoracic_diameter ~ "Thoracic diameter",
age ~ "Age",
male ~ "Male",
hypertension ~ "Hypertension",
cad ~ "Coronary artery disease",
artheriosclerosis ~ "Artheriosclerosis"
))| Characteristic | AAA N = 3001 |
CTRL N = 3001 |
TAA N = 3001 |
|---|---|---|---|
| Abdominal diameter | 3.95 (3.44, 4.59) | 2.00 (1.88, 2.11) | 1.98 (1.83, 2.11) |
| Thoracic diameter | 2.02 (1.94, 2.09) | 2.00 (1.84, 2.12) | 4.02 (3.79, 4.18) |
| Age | 62 (53, 71) | 62 (52, 71) | 62 (54, 71) |
| Male | 229 (76%) | 148 (49%) | 231 (77%) |
| Hypertension | 150 (50%) | 67 (22%) | 129 (43%) |
| Coronary artery disease | 53 (18%) | 25 (8.3%) | 31 (10%) |
| Artheriosclerosis | 91 (30%) | 157 (52%) | 112 (37%) |
| 1 Median (Q1, Q3); n (%) | |||
The order of the columns is a bit harder than just fixing the labels, as we haven’t covered the forcats package yet. What you need to know is that the diagnosis column is treated as factor (see data structures above). By default, the levels are alphabetically, namely “AAA” before “CTRL”. We are using the fct_relevel() to change the default order, setting “CTRL” as the first level.
aa_data %>%
select(-bmi) %>%
mutate(diagnosis = fct_relevel(diagnosis, "CTRL")) %>% # releveling using the forcats package
tbl_summary(by = diagnosis,
label = list(
abdominal_diameter ~ "Abdominal diameter",
thoracic_diameter ~ "Thoracic diameter",
age ~ "Age",
male ~ "Male",
hypertension ~ "Hypertension",
cad ~ "Coronary artery disease",
artheriosclerosis ~ "Artheriosclerosis"
))| Characteristic | CTRL N = 3001 |
AAA N = 3001 |
TAA N = 3001 |
|---|---|---|---|
| Abdominal diameter | 2.00 (1.88, 2.11) | 3.95 (3.44, 4.59) | 1.98 (1.83, 2.11) |
| Thoracic diameter | 2.00 (1.84, 2.12) | 2.02 (1.94, 2.09) | 4.02 (3.79, 4.18) |
| Age | 62 (52, 71) | 62 (53, 71) | 62 (54, 71) |
| Male | 148 (49%) | 229 (76%) | 231 (77%) |
| Hypertension | 67 (22%) | 150 (50%) | 129 (43%) |
| Coronary artery disease | 25 (8.3%) | 53 (18%) | 31 (10%) |
| Artheriosclerosis | 157 (52%) | 91 (30%) | 112 (37%) |
| 1 Median (Q1, Q3); n (%) | |||
We can also further customize the table by adding things like pValues and italicizing the labels.
aa_data %>%
select(-bmi) %>%
mutate(diagnosis = fct_relevel(diagnosis, "CTRL")) %>% # releveling using the forcats package
tbl_summary(by = diagnosis,
label = list(
abdominal_diameter ~ "Abdominal diameter",
thoracic_diameter ~ "Thoracic diameter",
age ~ "Age",
male ~ "Male",
hypertension ~ "Hypertension",
cad ~ "Coronary artery disease",
artheriosclerosis ~ "Artheriosclerosis"
)) %>%
italicize_labels() %>%
add_p()| Characteristic | CTRL N = 3001 |
AAA N = 3001 |
TAA N = 3001 |
p-value2 |
|---|---|---|---|---|
| Abdominal diameter | 2.00 (1.88, 2.11) | 3.95 (3.44, 4.59) | 1.98 (1.83, 2.11) | <0.001 |
| Thoracic diameter | 2.00 (1.84, 2.12) | 2.02 (1.94, 2.09) | 4.02 (3.79, 4.18) | <0.001 |
| Age | 62 (52, 71) | 62 (53, 71) | 62 (54, 71) | 0.8 |
| Male | 148 (49%) | 229 (76%) | 231 (77%) | <0.001 |
| Hypertension | 67 (22%) | 150 (50%) | 129 (43%) | <0.001 |
| Coronary artery disease | 25 (8.3%) | 53 (18%) | 31 (10%) | 0.001 |
| Artheriosclerosis | 157 (52%) | 91 (30%) | 112 (37%) | <0.001 |
| 1 Median (Q1, Q3); n (%) | ||||
| 2 Kruskal-Wallis rank sum test; Pearson’s Chi-squared test | ||||
In this session, we have covered a lot of ground, and we were moving fast.
We have
read in data using readr functions, including read_tsv()
we have gone over the basics of data wrangling: mutate() to add columns, filter() to filter observations, group_by() and summarise() to compute statistics
on the more complicated end, we have merged data frames and transformed wide data into long data using pivot_longer(), and the reverse using pivot_wider().
We have covered the generation of summary tables, a niche skill but a welcome addition to many papers.
I encourage you to go back over this document and the slides and, most importantly, to go over the practice questions below. You are acquiring a skill and that comes with practice. These drier bits are the necessary ground work for us to dive into visualization, mean separatation and statistics next session.
The following are questions for you to practise what you learned so far. Feel free to go back through the document and the slides, but I would advise against using something like an LLM - this is where the learning happens.
E1: Use the code below to generate a vector called words. Access the third value of words.
words <- c("This", "is", "so", "so", "great")E2: Use the length() function to get to the length of the vector. What is its length?
E3: What class is words?
Refer to the github and download the data from the data folder. As a help you can check out the the basic_R.html file, download and open it.
E3: Read in the aneurysm survey data. How many rows does it have? How many columns? Use the colnames() function to receive the column names.
E4: Add a column to the data frame containing a specific label for each patient. Which patient is the oldest? Which patient has the smallest thoracic diameter? Does that patient also have hypertension?
E5: Using the same data, how many patients have a thoracic diameter larger than 3.5 cm? Hint: use the filter() function
E6: We do not need that much precision in the diameters we have. Round them to two decimal points.
E7: We only want to know about potential comorbidities in our patients, so create an identifier per patient and de-select the diameter columns. How many columns remain?
E8: Using the data frame generated above, subset our survey; arteriosclerosis is actually an exclusion criterion, so remove all patients with arteriosclerosis. How many patients remain?
M1: What is the minimal aortic diameter in the AAA condition?
M2: What is the mean abdominal diameter in the data set?
M3: What is the mean abdominal diameter by diagnosis? Considering only patients older than 60, how does this change?
M4: If you consider the median instead of mean and thoracic diameter instead of abdominal, what are the results?
M5: The median is oftentimes used in conjunction with the interquartile range (IQR). Can you use the quantile() function to get to Q1 (25%) and Q3 (75%) values for BMI in our cohort (use ?quantile for more information)? Hint: You will need to use the probs argument in quantile().
H1: Refer back to the github and read in the expression data set. Merge them in the same fashion as above.
H2: Consider the gene CXCL8. What is the mean expression in the data set? What is the mean expression per treatment? What is the mean expression by treatment and cell line? Which sample shows the highest expression of CXCL8?
H2a: CXCL8 is the gene name coding for the protein IL8. For your table, you want to replace the gene name with the protein name. Hint: use str_replace() and mutate()
H3: Generate an overview table of the aneurysm survey data using gtsummary.
H4: Split the table by diagnosis.
H5: Subset your study data to only include patients over the age of 50 and with a BMI larger than 22. Regenerate the table.
H6: Going back to the original survey data; the 0 / 1 notation for comorbidities is clear and simple, but we might something more expressive. Replace the 1s in the hypertension column with “present”, the 0s with “absent”. Hint: Use the ifelse() function.
H7: In the easy questions, we rounded the diameter columns to 2 decimal points. Can you round them to no decimal points? Look into across() using ?across(). Can you round both of them at the same time using across()?
H8: Building on the previous answers, can you make use of it to perform the same replacement as in H6 in all comorbidity columns at the same time?